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MgAods and computer program products for the quality control of nucleic acid 

assays. 



Technical Field 

5 The field of the invention relates to methods and computer program products for the 
control of assays for the analysis of nucleic acid within DNA samples. 

Background Art 

A fundamental goal of genomic research is the ^plication of basic research into the 
sequence and functioning of the gmome to improve healthcare and disease 

10 management The ^plication of novel disease or disease treatment maikers to 
clinical and/or diagnostic settings often requires the ad2q>tation of suitable researdi 
techniques to large scale high throughput formats. Sudi techniques include the use of 
large scale sequencing, mRNA analysis and in particular nucleic acid microarrays. 
DNA microarrays are one of the most popular technologies in molecular biology 

1 5 today. They are routinely used for the parallel observation of the mRNA expression 
of thoxisands of genes and have enabled the development of novel means of maiker 
identification, tissue classification, and discovery of new tissue subtypes. Recently it 
has been shown ttiat microairays can also be used to detect DNA methylation and 
that results are comparable to mRNA expression analysis, see for example P. 

20 Adorjan et al. Tumoiir class prediction and discovery by microarray-based DNA 
methylation analysis. Nucleic Acid Research, 30(5), 02. and T. Golub et al. 
Molecular classification of cancer: Class discoyoy and class prediction by gene 
expression monitoring. Science, 286:531-537, 1999. 

Despite the popularity of microarray technology, there remain serious problems 
25 regarding measuremmt accuracy and reprodudbility. Considerable effort has been 
V put into the understanding and correction of effects such as backgroimd noise, signal 

noise on a slide and different dye efiBciencies see for example C. S. Brown et al. 

Image metrics in the statististical analysis of dna microarray data. Proc Natl Acad 

Sci USA, 98(16):8944-8949, July 2001 and G. C. Tseng et al. Issues in cdna 
30 miCToarray analysis: Quality filtering, channel normalization, models of variations 

and assessment of gene effects. Nucleic Acids Research, 29(12):2549-2557, 2001 . 

However, with the exception of overall intensity normalization (A. Zien et al. 
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Centralization: A new method for the normalization of gene expression data, Proc. 
ISMB VI /Bioinformatics, 1 7(6):323-33 1 , 2001), it is not clear how to handle 
variations between single slides and systematic alterations brtwera slide batdies. 

Between slide variations are particularly problematic because it is dilBcult to 
5 explicidy model the num^ous different process factors which may distort the 
measurements. Some examples are concmtration and amount of spotted probe during 
array fabrication, the amount of labeled target added to the slide and the general 
conditions during hybridization.Oth^ common but often neglected problems are 
handling errors such as accidental exchange of dififerent probes during array 
10 fabrication. These effects can randomly affect single slides or whole slide batches. 
The latter is especially dangerous because it introduces a systematic error and can 
lead to false biological conclusions. 

There are sev^al ways to reduce between slide variance and systematic mots. 
Removing obvious outli^ drips based on visual inspection is an easy and effective 

15 way to increase experimental robustness. A more costly alternative is to do repeated 
chip experiments for every single biological sample and obtain a robust estimate for 
the average signal. With or without drip repetitions randomized block design can 
further increase certainty of biological findings. Unfortunately, there are several 
problems with this approach. Outliers can not always be drtected visually and it is 

20 not feasible to make mougb chip repetitions to obtain a fully randomized block 
design for all potentially important process parameters. However, when experiments 
are standardized ^ough, process dependent alterations are relatively rare events. 
Therefore instead of reducing these effects by repetitions one should rather detect 
problematic slides or slide batdies and repeat only those. This can only be achieved 

25 by controlling process stability. 

Maintaining and controlling data quality is a key problem in high throughput analysis 
systems. The data quality is often hampered by experiment to experiment variability 
introduced by the environmental conditions that may be diflBcult to control. 
Examples of such variables include, variability in sample preparation and 

30 uncontrollable reaction conditions. For example, in the case of micro array anal^is 
systCTiatic dianges in experimratal conditions across multiple chips can seriously 
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affect quality and even lead to false biological conclusions. Traditionally the 
influence of these efifects has been minimized by expensive repeated measurements, 
because a detailed understanding of all process relevant parameters appears to be an 
unreasonable burden. 

Process stabiUty control is weU known in many areas of industrial production where 
multivariate statistical process control (MVSPC) is used routinely to detect 
significant deviations from normal working conditions. The major tool of MVSPC is 

the t2 control chart, which is a multivariate generalization ofthe popular univariate 
Shewhart control procedure. 

See for example U.S. Patent number 5,693,440. In this application HoteUing's T2 in 
combination with a simple PCA was used as a means of process verification in 
photographic processes. Although this application denonstrates the use of simple 
principle component analysis, the benefits of this are not obvious as the data set was 
not of a high dimensionality as is often encountered in biotechnological assays such 
as sequencing and microarray analysis. Furthermore, this ^plication recommends 
the application of PCA on the "cleared" reference data set, whidi may hide 
variations caused by the data set to be monitored. 

The applicarion of MVSPC for statistical quality control of microarray and high 
throughput sequencing experiments is not straightforwarxi. TTiis is because most of 
the relevant process parameters of a microarray experiment cannot be measured 
routinely in a high throughput environment 

5-methyicytosine is the most frequent covaleat base modification of the DNA of 
eukaryotic cells. Cytosine methylation only occurs in the context of CpG 
dinucleotides. It plays a role, for example, in the regulation ofthe transcription, in 
genetic imprinting, and in tumorigenesis. Methylation is a particulariy relevant layer 
of genomic information because it plays an important role in expression regulation 
(K. D. Robertson et al. DNA methylation in health and disease. Nature Reviews 
Genetics, 1:11-19, 2000). Methylation analysis has therefore the same potential 
applications as mRNA expression analysis or proteomics. hi particular DNA 
methylation appears to play a key role in imprinting associated disease and cancer 
(see for example, Zeschnigk M, Schmitz B, Dittrich B, Suiting K. Horsthemke B, 
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Doerfler W. "Imprinted segments in the human genome: different DNA methylation 
pattons in the Prader-Willi/Angelman syndrome region as determined by the 
genomic sequencing method" Hum Mol Genet 1997 Mar;6(3):387-95 and Peter A. 
Jones "Cancer. Death and meth^ation". Nature- 2001 Jan 11;409(6817):141, 143-4. 
5 The link betweai cytosine methylation and cancer has already been established and it 
appears that cytosine methylation has the potential to be a significant and useful 
clinical diagnostic marker. 

The application of molecular biological techniques in the field of meth>4ation 
1 0 anal)^is have hereto been limited to research applications, to date it is not a 
commercially utiUsed clinical marker. The application of mediyiation disease 
markers to a large scale analysis format suitable for clinical, diagnostic and research 
purposes requires the implementation and adaptation of high throughput tedmiques 
in the field of molecular biology to the specific constraints and demands specific to 
1 5 methylation analysis. Preferred techniques for such analyses include the analysis of 
bisulfite treated sample DNA by means of micro array tedmologies, and real time 
PGR based methods sudi as MethyLight and HeavyMethyl. 
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Disclosure of Invmrinn 

Bhef desoipdon 

The described invention provides a novel method and computer program 
products for the process control of assays for the analysis of nucleic acid within 
DNA samples. The method enables the estimation of the quality of an individual 
assay based on the distribution of the measurements of variables associated with 
said assay in comparison to a reference data set As these measurements are 
extremely high dimensional and contain outliers the plication of standard 
MVSPC methods is prohibited. In a particularly preferred embodiment of the 
method a robust version of principle component analysis is used to detect outliers 
and reduce data dimensionality. TTiis step enables the improved ^Ucation of 
multivariate statistical process control techniques. In a particularly preferred 
embodiment of the method, the t2 control chart is utUised to monitor process 
relevant parameters. This can be used to improve the assay process itself, limits 
necessary i^titions to affected samples only and thereby maintains quality in a 
cost effective way. 
Detailed description 

In the following ^plication the term 'statistical distance' is taken to mean a 
distance between datasets or a single measurement vector and a data set that is 
calculated with respect to the statistical distribution of one or both data sets. 
In the following the term 'robust' when used to describe a statistic or statistical 
method is taken to mean a statistic or statistical method that retains its usefiJness 
even when one or more of its assumptions (e.g. normaUty. lack of gross errors) is 
violated. 

The method and computer program products accojxiing to the disclosed invention 
provide novel means for the verification and controlling of biological assays. 
Said method and corhputer program products may be applied to any means of 
detecting nucleic acid variations wherein a large number of variables are 
analysed, and/or for controlling experiments wherein a large number of variables 
influence the quality of the experimoital data. Said method is therefore 
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applicable to a large number of commonly used assays for the analysis of nucleic 
acid variations including, but not limited to, microarray analysis and sequendng 
for example in the fields of mRNA expression analysis, single nucleotide 
polymoiphism detection and epigoietic analysis.. 
5 To date, the automated analysis of nucleic add variations has been limited by 

experiment to experimmt variation. Errors or fluctuations in process variables of 
the environment within which the assays are carried out can lead to decreased 
quality of assays whidi may ultimately lead to false interpretations of the 
expoimental results. FurthOTnore, certain constraints of assay design, most 
1 0 notably nucleic acid sequence (which affects fectors such as cross hybridisation, 
badcground and noise in microarray analysis) , may be subject to expaiment to 
experiment variation furtho- complicating standard means of assay resxilt analysis 
and data interpretation. 

One of the factors that complicates the controlling of such hi^ througlq)ut 

1 5 assays within predetermined parametm is the high dimensionality of tiie datasets 
which are required to be monitored. Therefore, multiple rq>etitions of each assay 
are often carried out in order to minimize the effects of process artefacts in the 
interpretation of complex nucleic acid assays. There is therefore a pronounced 
need in the art for improved methods of insuring the quality of high throughput 

20 genomic assays. 

In one embodiment, the m^hod and computer program products according to the 
invention provide a means for the improved detection of assay results which are 
unsuitable for data interpretation. The disclosed method provides a means of 
identifying said unsuitable experiments, or batches of experiments, said identified 

25 experiments thereupon being excluded from subsequent data analysis. In an 

alternative embodiment said identified expoiments may be further analysed to 
identify specific operating parameters of the process used to carry out the assay. 
Said parameters may flien be monitored to bring the quality of subsequent 
experiments within predetermined quality limits. The method and computer 

30 program products according to the invention thereby decrease the requirement for 
repetition of assays as a standard means of quality control. The method according 
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to the invention forther provides a means of increasing the accuracy of data 
interpretation by identifying experiments unsuitable for data analysis. 
In the following it is particularly prefen:ed that all herein described elements of 
the method are implonoited by means of a computer. 
5 The aim of the invention is achieved by means of a method of verifying and 
controUing nucleic add analysis assays using statistical process control and/or 
and computer program products used for said purpose. The statistical process 
control may be either multivariate statistical process contiol or univariate 
statistical process control. Hie suitabiUty of ead» method will be apparent to one 

10 skilled in the art The melhod according to the invention is characterized in that 
variables of each experiment are monitored, for each experiment the statistical 
distance of said variables from a reference data set (also herein referred to as a 
historical data set) are calculated and wherein a deviation is beyond a pre- 
determined limit said experimoit is indicated as unsuitable for further 

1 5 interpretation. It is particularly preferred that the me&od according to the 
invention is implonaited by means of a compute. 

In a preferred embodiment this method is used for the contiolling and verification 
of assays used for the determination of cytosine methylation patterns wifliin 
20 nucleic acids. In a particularly preferred embodiment the method is appUed to 

those assays suitable for a high throughput format, for exan^le but not limited to, 
sequencing and microairay analysis of bisulphite treated nucleic adds. 

In one embodiment, the method according to the invention comprises four steps. 

25 In the first stq) a reference data set (also herein referred to as a historical data set) 
is defined, said data set consisting of all the variables tiiat are to be monitored 
and conh^oUed. In the second ste^) a test data set is defined. Said test data set 
consists of the experiment or experiments that are to be controUed, and whetdn 
each experiment is defined according to the values of the variables to be 

30 analysed. 

In the third step of the method the statistical distance between the reference and 
test data sets or dements or subsets thereof are determined. In tiie fourth step of 
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the method individual elements or subsets of the test datasrt which have a 
statistical distance larger than that of a predetermined value are identified, 
hi a paiticulariy preferred embodiment of the method, subsequent to flie 
definition of the reference and test data sets the method comprises a fiirther step, 
5 hereinafter referred to as step 2ii). Said step comprises reducing the data 

dimensionality of the reference and test data set by means of robust embedding of 
the values into a lower dimensional representation. The embedding space may be 
calculated by using one or both of the ref»aice and the test data set. It is 
particularly preferred that the data dimensionality reduction is carried out by 

1 0 means of principle component analysis. In one onbodiment of flie m^od step 
bii) comprises the following steps. In the first step the data set is projected by 
means of robust principle component analysis. In the second step outliCTS are 
removed fi-om the data set according to their statistical distances calculated by 
means of one or more methods taken from the gjKsap consisting of: Hotelling's 

1 5 distance; pocentiles of the empirical distribution of the refo^ce data se^. 

Percentiles of a kanel density estimate of the distribution of the reference data 
set and distance 6om the hyperplane of a nu-SVM (see Schlkopf, Bemhard and 
Smola, Alex J. and Williamson, Robert C. and Bartlett, Peter L., New Support 
Vector Algorithms. Neural Computation, Vol. 12, 2000.), estimating the support 

20 of the distribution of the refaence data set. In tiie durd step flie embedding 

projection is calculated by means of standard prindple componoit analysis and 
die cleared or the complete data set is projected onto this basis vector system. 
In one embodimoit of the method at least one of the variables measured in steps 
a) and b) is drtermined according to the meth>dation state of the nucleic acids. 

25 In a further preferred embodiment of the method at least one of the variables 

measured in flie first and second steps is determined by the environment used to 
conduct the assay, wherein the assay is a miooarray analysis it is fiirther 
prefwred fliat these variables are iDdcpeodeai of the arrangement of die 
oligonucleotides on the array. In a particularly preferred embodunent said 

30 variables are selected firom the group comprising mean backg^ound^aseline 
values; scatter of the backg^)und^aseUne values; scatter of die foreground 
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values, geometrical properties of the array, percentiles of background values of 

each spot and positive and negative assay control measures. 
In a further preferred embodiment of the method at least one of the variables 
measured in the first and second stq>s is deteraiined by the enviromnent used to 
conduct the assay, wherein the assay is a microanay analysis it is further preferred 
that these variables are indq^ndent of the arrangement of the oligonucleotides on the 
array. 

In a particularly preferred embodiment wherein the assay is a miaoaitay based assay 
said variables are selected from the group comprising mean badcgiound^aseline 
intensity values; scatter of the background/baseline intensity values; coefficient of 
variation for background spot intensities, statistical characterisation of the 
distribution of the badcground/baseline intensity values (1 %, 5%, 1 0%, 25% 50%, 
75% 90%, 95%, 99% percentiles, skewness, kurtosis), scatter of the foreground 
intensity values ; coefficient of variation for foreground spot intensities; statistical 

characterisation ofthedistributionoftheforegroundintensityvalues(l%,5%, 10%, 
25% 50«/o. 75% 90%. 95o/., 99% percentiles, skewness, kurtosis), saturation of the ' 
foreground intensity values, ratio of mean to median for^round intensity values, 
geometrical properties of the array as in the gradient of background intensity values 
calculated across a set of consecutive rows or columns along a given direction, mean 
spot diameter values, scatter of spot diameter values, percentiles of spot diameter 
value distribution across tiie microarray, and positive and negative assay conht>l 
measures. 

When selecting appropriate variables for the analysis an important criterion is that 
the statistical distribution of these variables does not change significantiy between 
different series of experiments (wherein each series of experiments is defined as a 
large series of measurements carried out within one time period and with the same 
assay design). This allows the utillisation of measurements fiom previous studies as 
refo^nce data sets. 

Wherein the assay is a microarray based assay it is preferred tiiat the variables to be 
analysed include at least one variable diat refers to each of the foreground, 
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background, geometrical properties and saturation of the miCToarray. A particularly 
preferred set of variables is as follows : 



• Background 

5 1. 75% quantile of all observed values of the percentage of background pixel 

per spot above the mean signal + one standard deviation 

2. 75% quantile of all observed values of the percentage of background 
pixel per spot above the mean signal + two standard deviations 

3. skewness of the distribution of observed values of the median badcground 
1 0 intensity per spot 

4. mean value of the ratio of observed values: mean background intensity 
divided by median badcground intensity per spot 

• Geometry 

1 . 75% quantile of all observed values of the differrace of background 
15 intensities of four consequtive rows avereraged and the following 4 

consequtive rows 

2. same as in 1. for columns 

• Spot Characteristic 

1 . 95% qiumtile of all observed spot diameters 
20 2. median (50% quantile) of all observed spot diameters 

3. 75% quantile of the ratio of observed values defined by: standard deviation of 
foregroxmd intensity per spot divided by mean of foreground intoisity per 
spot 

4. median of the ratio of all observed values defined by: mean foreground 
25 intensity per spot divided by median foreground intensity per spot 

• Saturation 

1 . 95% quantile of foreground intensity pixel saturation percentage per spot 
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For each variable or group thereof the further steps of the method are according to 
the described method. Therefore, in one embodiment of the method first calculate the 
statistical distance of each variable firom Ae reference dataseL It is prefeired that the 
refwence data set is composed of a large set of previous measuranents, that is 
obtained under similar experimental conditions. Then combine variables within each 
category either by embedding into a 1 -dimensional space or by averaging single 
values. 

Preferably, both the statistical distance and the embedding is carried out in a robust 
way. 

In a further preferred embodiment the to calculate quality of the experiment firet 
calculate a lower dimensional embedding of both the referaice and the test data set 
It is preferred that the reference data set that is used is composed of a large set of 
previous measurements, that are obtained under similar experimental conditions. 
Secondly, calculate the statistical distance in this reduced dimensional space. Use 
this statistical distance as the quality score. 

It will be obvious to one skilled in the art that is not necessary that the second step of 
the method is temporally subsequent to the first step of the method. The reference 
data set may be defined subsequent to the test data set, alternatively it may be 
defined concurrently with the test data set. hi one embodiment of the method the 
reference data set may consist of aU experiments run in a series wherein said series is 
user defined. To give one example, where a microarray assay is q)plied to a series of 
tissue samples the measured variables of all the samples may be included in said 
reference data set, however analyses of the same tissue set using an alternative array 
may not . Accordingly the test data set may be a subset of or identical to the 
reference data set. Li another embodiment of the method the reference data set 
consists of experiments that were carried out independent or separate from those of 
the test data set The two data sets may be differentiated by factors such as but not 
limited to time of production, operator (human or machine) , environment used to 
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cany out the experiment (for example, but not limited to temperature, reagents used 
and cx)ncentrations thereof, temporal factors and nucleic acid sequence variations). 
In a further embodiment of the method the reference data set is derived fit>m a 
set of experiments wh^-ein the value of eadi analysed variable of each 
5 experiment is either within predetermined limits or, alternatively, said variables 
are controlled in an optimal maimer. 

In step 4 of the method the statistical distance may calculated by means of one or 
more methods taken from the group consisting of the Hotelling's distance 
between a single test measurment vector and the referaice data set, the 

1 0 Hotelling'-T^ distance betweai a subset of the test data set and the refwence data 
set, the distance between the covariance matrices of a subset of the test data set 
and the covariance matrix of the reference set, percentiles of the empirical 
distribution of the reference data set and percentiles of a kernel density estimate 
of the distribution of the refia^ce data set, distance from the hyperplane of a nu- 

1 5 SVM (see Schlkopf; Bemhard and Smola, Alex J. and Williamson, Robert C, and 
Bartlett, Peter L,, New Support Vector Algorithms. Neural Computation, Vol. 12, 
2000.), estimating the support of the distribution of the reference data set 
Wherein Hotelling*s distance between a single test measurement vector and 
the refa-ence data set is measured, it is preferred that the distance is 

20 calculated by using the sample estimate for mean and variance or any robust 
estimate for location, including trinmied mean, median,Tukey's biwight, 11- 
median, Oja-median, minimum volume ellipsoid estimator and S-estimator (see 
Hendrik P. Lopuhaa and Peter J. Rousseeuw: Breakdown points of affine 
equivariant estimators of multivariate location and covariance matrices) and any 

25 robust estimate for scale including Median Absolute Deviation, interquantile 
range Qn-estimator, minimum volume ellipsoid estimator and S-estunator. 
In a particularly preferred embodiment this is defined as: 

rhi) = (Mi - ^Ys^^ (nn 

wherein reference set mean = ( V ^^c) Zii^i 
30 and the reference set san^>le covariance matrix 
S = mNc-l) E£ Otti-tiXmi-H)' 
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wherein Nc is tiie number of experiments in the reference set and m, is the is the 
ith measurement vector of the reference or test data set. 
Wherein the Hotelling*-!^ distance is calculated between a subset of the test data 
set and the reference data set, it is preferred that the is calculated by using the 
5 sample estimate for mean and variance or any robust estimate for location, 

including trimmed mean, median,Tukey'sbiwight, 1 1 -median, Oja-median and 
any robust estimate for scale including Median Absolute Deviation, interquantile 
range Qn-estimator, minimum volume ellipsoid estimator and S-estimator . In a 
particularly preferred embodiment this is defined as: 

10 ^2(0 = if^HDS ~ f^CDsfs^^il-^HDS ~ f^CDs) 

Wherein *HDS' refers to the historical data set, also referred to herein as the 
reference data set and 'CDS' refers to the current data set also referred to herein 
as the test data set Furthermore, S is calculated from the sample covariance 
matrices Shds ^nd Scos 

j; ^ (NhDS - \ )SfJDS + (NCDS - l)ScDS 
15 %D5+At£>5-2 

Wherein the statistical distance is calculated as the distance between the 
covariance matrices of a subset of the test data set and the covariance matrix of 
the reference set, it is prefeired that the test statistics of flie likelihood ratio test 
for different covariance matrixes are included. See for example Hartung J. and 
20 Epelt B: Multivariate Statistik. R. Oldenburg, Munchen, Wien, 1995. to a 
particularly preferred embodiment this is defined as: 



L(i) = 2\]n\S\ - ^"fl^ ^ , \n\StTDs\ 

f^HDS + f^CDS-2 

A'CDJ - 1 "I 
~ Ti TTr T \^CDS\ 



In a furtho- embodiment of ttie method, subsequent to stqjs 1 to 4 , die method 
25 may further comprise a fiftti step, hi a first anbodiment of the method said 
identified experiments or batches thereof are further interrogated to identify 
specific opwating parameters of the process used to carry out the assay that may 
be required to be monitored to bring the quaUty of ttie assays within 
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predetennined quality limits. In one embodimait of the method this is enabled by 
means of verifying the influence of each individual variable by computing its' 
univariate distances between reference and test data set In a further 
embodiment one may analyse the oitfaogonalized distance computing flie 
5 PCA onbedding of step 2ii) based on the Tcfecence data set The principle 

component re^nsible for the largest part of the 1^ distance of an out of control 
test data point may then be identified. Responsible individual variables can be 
identified by their weights in this principle component. In a further embodiment 
variables responsible for the out of control situation can be idoitified by 

1 0 backward selection. A subset of variables or single variables can be excluded 
from die statistical distance calculation and one can observe wfae&a- the 
computed distance gets significantly smallCT. Wier&n the computed statistical 
distance significantly deweases one can conclude that the excluded variables 
wK-e at least partially responsible for the observed out of control situation. 

15 In a fiirther embodiment, said idoitified assays are designated as unsuitable for 
data interpretation, the e7q)aimait(s) are excluded from data intopretation, and 
are pref^ably repeated until idoitified as having a statistical distance within the 
predetermined limit 

In a particularly preferred embodiment, the method further comprises the 
20 generation of a document comprising said elements or subsets ofthe test data 
determined to be outliers. In a fiirtha embodimoit said document fiirthe- 
comiHises the contribution of individual variables; to the detomihed statistical 
distance. It is prefored that said documeit be genoated in a readable manna-, 
either to the uso: ofthe computer program or by means of a computer, and 
25 wherein said computer readable document further comprises a graphical user 
interface. 

Said document may be generated by any means standard in the art, howevo-, it is 
particulariy preferred that the document is automatically genoated by computer 
implemoited means, and that the document is accessible on a computer readable 
30 format (e.g. HTML, portable documait format (pdf), postscript (ps)) and variants 
thereof. It is further preferred ttiat the document be made available on a server 
eiabling simultaneous access by multiple individuals. In another aspect ofthe 
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invention computer program products are provided. An exemplary computer 
program product comprises: 

a) a computer code that receives as input a reference data set 

b) a computer code that receives as input a test data set 

c) a computer code that determines the statistical distance between the reference 
data set and test data set or elements or subsets thereof 

d) a computer code that identifies individual elements or subsets of the test 
dataset which have a statistical distance larger than that of a predetermined value 

e) a computer readable medium that stores the computer code. 

It is further preferred tiiat said computer program product con^iises a.computer 
code for the reduction of the data dimensionality of the reference and test data set 
by means of robust embedding of the values into a lower dimensional 
represaitation. 



hi a preferred embodiment the computer program product further comprises a 
computer code that reduces the data dimensionaUty of the reference and test data 
set by means of robust embedding of the values into a lower dimensional 
representation. In this embodiment of the invention tiie embedding space may be 
calculated using one or both of the reference and the test data sets. In one 
particularly preferred embodiment tiie computer code carries out the data 
dimensionality reduction step by means of a metfiod comprising the foUowing 
steps: 

i) Projecting the data set by means of robust principle component analysis 

ii) Removing outUers fiom the data set according to their statistical distances 
calculated by means of one or more methods taken fi-om the group consisting of: 
Hotelling's distance; percentiles of the empirical distribution of the reference 
data set; Percentiles of a kernel density estimate of flie distribution of the 
reference data set and distance fiom the hypoplane of a nu-SVM, estimating the 
support of the distribution of the reference data set 

iu) Calculating the embedding projection by standard principle component 
analysis and projecting the cleared or the complete data set onto this basis vector 
syston. 
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In a further preferred anbodiment the computer program product further 
comprises a computer <x)de that generates a document comprising said elemaits 
or subsets of the test data identified by the computer code of step d). It is 
preferred that said document be generated in a readable manner, either to the user 
5 of the computer program or by means of a computer, and wherein said computer 
readable document further comprises a graphical user interface. 

Examples 

Example 1 

10 In this example the method according to the invention is used to control flie analysis 
of methylation patterns by means of nucleic acid microarrays. 
h order to measure the methylation state of different CpG dinucleotides by 
hybridization, sample DNA is bisulphite treated to convert all unmethylated 
cytosines to uracU, this treatment is not effective upon methylated cytosines and they 

15 are consequently conserved. Genes are then ampUfied by PGR using fluorescently 
labelled primers, in the amplificate nucleic adds unm^ylated CpG dinucleotides 
are represented as TG dinucleotides and methylated CpG sites are conserved as CG 
dinucleotides. Pairs of PGR primers are multiplexed and designed to hybridise to 
DNA segments containing no CpG dinucleotides. This allows unbiased amplification 

20 of multiple aUeles in a single reaction. AU PGR products fiom each individual 
sample are then mixed and hybridized to glass slides carrying a pair of immobilised 
oligonucleotides for each CpG position to be analysed. Each of these detection 
oUgonucleotides is designed to hybridize to the bisulphite converted sequence around 
a specific CpG site which is either originally unmethylated (TG) or methylated (CG). 

25 Hybridization conditions are selected to allow the detection of the single nucleotide 
differences between the TG and CG variants. 

La the following, NcpQ isthe number of measured CpG positions per slide, is 
the number ofbiological samples in the study and Nc is the number of hybridized 
chips in the study For a specific CpG position k i{l,...,NcpG} . the frequency of 
30 methylated alleles in sample j i{l....,Ns} , hybridized onto chip i I{l,....Nc} can 
(hen be quantified as equation 1 
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where CGik and TGik are the corresponding hybridization intensities . This ratio is 
invariant to the overall intensity of flie particular hybridization experiment and 
tiierefore gives a natural nonnalization of our data. 

Here we will refer to a single hybridization experim«it i as experiment or chip. The 
resulting set of measurement values is the methylation profile 
mi=(mii,...,miNCpG)' • We usuaUy have several repeated hybridization experiments 
i for every single sample j . The methylation profile for a sample j is estimated 
fi-om its set of repetitions Rj by the M -median as mj=xiiRj|mi-x|2. In contrast to 
the simple component wise median this gives a robust estimate of the meth^ation 
profile that is invariant to orthogonal linear transformations sudi as PCA. 

Data sets 

In our analysis we used data fiom three microarray studies, hi each study the 
methylation status of about 200 different CpG dinucleotide positions from promoters, 
intronic and coding sequoices of 64 genes was measured. 

Temperature Control : Our first set of 207 chips came fix)m a control experiment 
where PGR amplificates of DNA fiom the peripheral blood of 1 5 patients diagnosed 
with ALL or AML was hybridized at 4 different temperatures (38C,42C,44C,46C). 
We will use this data set to prove that our method can reliably detect shifts in 
expOTmental conditions. 

Lymphoma : The second data set with an overaU number of 647 chips came fiom a 
study where the methylation status of different subtypes of non-Hodgkin lymphomas 
fiom 68 patients was analyzed. All chips underwent a visual quality control, resulting 
in quality classification as "good" (proper spots and low background), "acceptable" 
(no obvious defects but uneven spots, high background or weak hybridization 
signals) and "unacceptable" (obvious defects). We wiU use this data set to identify 
differait types of outliers and show how our methods detect tiiem. 
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In addition we simulated an accidental exchange of oUgo probes during slide 
fabrication in order to demonstrate diat such an effect can be detected by our method. 
The exchange was simulated in silico by permuting 12 randomly selected CpG 
positions on 200 of the chips (corresponding to an accidental rotation of a 24 weU 
oligo supply plate during pr^aration for spotting). 

ALUAML : FinaUy we show data from a second study on ALL and AML, 
containing 468 chips from 74 different patients. During the course of this study 46 
oligomeres had to be re-synthesized, some of whidi showed a significant change in 
hybridization bdiavior, due to synthesis quality problems. We will demonstrate how 
our algorithm successfully detected this systematic change in experimental 
conditions. 

Typical artefacts 



Typical artefects in microarray based methylation analysis are shown in Figure 1. 
The plots show the correlation between single or averaged methylation profiles. 
Every point corresponds to a sin^e CpG position, the axis-values are log ratios, a) A 
normal chip, showing good correlation to the sample average, b) A chip classified as 
"unacceptable" by visual inspection. Many spots showed no signal, resulting in a log 
ratio of 0. c) A chip classified as "good". Hybridization conditions were not stringent 
enough, resulting in saturation. In many cases pairs of CG and TG oUgos showed 
nearly identical high signals, giving a log ratio around 0. d) A chip classified as 
"acceptable". Hybridization signals were weak compared to the background 
intensity, resulting in a high amount of noise, e) Comparison of group averages over 
25 all 64 ALL^AML chips hybridized at 42C and all 48 ALUAML chips hybridized at 
44C. f) Comparison of group averages over 447 regular chips ftom the lymphoma 
data set and the 200 chips with a simulated accidental probe exchange during slide 
production, afifecting 12 CpG positions. 

With a high number of r^licatipns for each biological sample and the corresponding 
average m being reliably estimated, outlier chips can be relatively easily detected by 
then- strong deviation fix)m the robust sample average, fa the following, we will 
discuss some typical outlier situations, using data from the Lymphoma experiment 
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In this case the hybridization of each sample was repeated at a very high redundancy 
of 9 chips. 

After identifying possible error sources the question remains how to reliably detect 
them, in particular if they can not be avoided with absolute certainty. One aim of the 
invention is therefore to exclude single outlier chips fiom the analysis and to detect 
systematic changes in experimental conditions as early as possible in order to 
facilitate a fast recalibration of the production process. 



Detecting Outlier Chips widi Robust PCA 
M^ods 

As a first step we want to detect single outlier chips, to contrast to standard statistical 
approaches based on image features of single slides we will use the overall 
distribution of the whole experimental series. This is motivated by the feet that 
although image analysis algorithms will successfully detect bad hybridization 
signals, they wiU usually feil in cases of unspecific hybridization. The aim is to 
identify the region in measurement space where most of the chips mj, i=l ...Nc , are 
located. The region will be defined by its center and an upper limit for the distance 
between a single chip and the region center. Chips with deviations higher than the 
upper limit will be regarded as outliers. 

A simple approach is to indq)endently define for every CpG position k the 
deviation fix>m the center fxfc as i^Q, - ^kjsk hereinafter referred to as Equation 3. 
where Wc=(l/N)i is the mean and sVl/(N-l)i(mik - |ik)2 is the sample 
variance over all chips. Assuming that the mflc are normally distributed, tfc 
muWpUed by a constant follows a t -distribution widi N-1 degrees of fi-eedom. This 
can be used to define the upper limit of the admissible region for a given significance 
level a. 

However, a separate treatment of the different CpG positions is only optimal when 
their measurement values are independent As Fig.2 demonstrates it is important to 
take into account the correlation between different dimensions. It is possible that a 



wo 03/083757 

PCT/EP03/03288 

20 

point which is not detected as an outlier by a component wise test is in reaUly an 
ouUier (e.g. Pi in Fig.2). On the other hand, there are points that will be erroneously 
detected as oulliers by a component wise test (e.g. ?2 in Fig.2). Because microanay 
data usuaUy have a very high correlation, it is better to use a multivariate distance 
concqjt instead of the simple univariate -distance. A natural goieralization of the 
tfc -distance is given by Hotelling's t2 statistic, defined as Equation 4: 

T^(i) = (mi - fxys-\mi - 



ui«ui — I * and sample covanance 

Assuming that the mj are multivariate normally distributed. t2 multipUed by a 
constant follows a F -distribution with Nc-NcpG degrees of freedom and the non- 
centrality parameter NcpQ . This can be used to define the upper limit of the 
admissible region for a given significance level a. 

Two problems arise when we want to use the t2 -distance for microarray data: 

1. For less chips than measurements NcpG , the sample covariance 
matrix S is singular and not invotible. 

2. The estimates for |i and S are not robust against outUers. 

The first problem can be addressed by using prindple component analysis (PCA) to 
reduce the dimensionality of our measurement space . This is done by projecting aU 
methyiation profiles mj onto the first d eigenvectors with the highest variance. As 
a result we get the d -dimensional centered vectors i=PpcA(mi-M) in eigenvector 
space. After the projection, the covariance matrix =diag(i,...,d) of the reduced space 
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is a diagonal matrix and the t2 distance of Equation 4 is approximated by the t2- 
distance in the reduced space 

d -2 



Under the assumption that the true variance is equal to^j^ ^ follows a X 
distribution with d degrees of freedom. This can be used to define the upper 
significance level a. However the problem remains that the estimated eigenvectors 
arid variances/; are not robust against outlios. 

We propose to solve the problem of outUer sensitivity together with the dimension 
reduction step by using robust principle component analysis (rPCA). rPCA finds the 
fiist d directions with the largest scale in data space, robustly approximating the 
first d eigenvectors. The algorithm starts with centering the data with a robust 
location estimator. Here we will use the Li median according to Equation 6: 

Ac 

Hl I = aismiii Y* |ih/ - jHj. 

In contrast to the simple component-wise median, this gives a robust estimate of the 
distribution center that is invarikt to orthogonal linear transformations such as PCA. 
TTien all centered observations are projected onto a finite subset of aU possible 
directions in measurement space. The direction with maximum robust scale is chosen 
as an approximation of the largest eigenvector (e.g. by using the estimator )• 
After projecting the data into the orthogonal subspace of the selected "eigenvector" 
the procedure searches for an approximation of the next eigenvector. Here the finite 
set of possible directions is simply chosen as the set of centered observations 
themselves. 

After obtaining flie robust projection of our data into a d -dimensional subspace we 
can compute the upper limit of the admissible r^on 2^CL . also referred to as the 
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upper control limit (UCL). For a given significance level a it is computed as 
Equation 7: 

t2 _ 2 
'UCL — JCif.I-fi,- 

Every observation mj with T^p^uQi is regarded as an outiier. 
Results 

In order to test how flie rPCA algorithm works on microarray data we appUed it to 
the Lymphoma dataset and compared its perfonnance to classical PCA. The results 
are shown in Figure 3 . 

The rPCA algorithm detected 97% of the chips with "unacceptable" quality, 
whereas classical PCA only detected 29%. 10% of the "acceptable" chips were 
detected as outliers by rPCA, whereas PCA detected 3% . rPCA detected 21 chips . 
as outliers which were classified as "good". These chips have all been confirmed to 
show saturated hybridization signals, not identified by visual inspection. This means 
1 5 rPCA is able to detect nearly all cases of outlier chips identified by visual inspection. 
Additionally rPCA detects microarrays which have unconspicous image quaUty but 
show an unusual hybridization pattero. 

An obvious concan with this use of rPCA for outlier detection is fliat it relies on the 
assumption of nonnal distribution of the data. If the distribution of the biological 

20 data is highly multi-modal, biological subclasses may be wrongly classified as 
outliers. To quantify this effect we simulated a very strong cluster structure in the 
Lymphoma data by shifting one of the smaUer subclasses by a multiple of the 
standard deviation. Only when the measurements of all 174 CpG of the subclass 
where shifted by more than 2 standard deviations a considerable part of the 

25 biological samples were wrongly classified as outUers. In order to avoid such a 
misclassification, we tolerate at most 50% of repeated measurements of a single 
biological sample to be classified as outliers. However, we never reached this 
threshold in practice. 



30 Statistical process control 
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Me&ods 



In the last section we have seen how outliers can be detected soldy on the basis of 
the overaU data distribution. Statistical process control expands this approach by 
5 introducing the concept of time. The aim is to observe the variables of a process for 
some time under perfect working conditions. The data coUected during this period 
form the so called historical data set (HDS). also referred to above as the ^reference 
data set'. Under the assumption that all variables are normally distributed, the mean 
HHDS and the sample covariance matrix Shds of the historical data set folly 
0 describe the statistical bdiavior of the process. 

Given the historical data set it becomes possible to check at any time point, I. how 
far the current state of the process has deviated from the perfect state by computing 
the t2 -distance betweai the ideal process mean jihDS and the current observation 
mi . This corresponds to Equation 4 with the overall sample estimates n and S 
5 replaced by their reference counterparts hhds and Shds • Any change in the 
process will cause observations with greater t2 -distances. To decide whether an 
observation shows a significant deviation from the HDS we compute the upper 

7.2 Pin + ])( n - I) 
control limit as in Equation 8: «<« - p) ''''""'^ ^^^^^ ^ 

the number of observed variables, n is the number of observations in the HDS, a is 
the significance level and F is the F -distribution with n-p degrees of freedom and 
the non-centrality parameter p . Whenever t2>t2ucl is observed flie process has 
to be regarded as significantly out of control . 

In our case the process to control is a microany experiment and the only process 
variables we have observed are the log ratios of the actual hybridization intensities. 
A single observation is then a chip mj and the HDS of size NhdS is defined as 
{mi,...,mNHDS} • We have to be aware of a few important issues in this 
interpretation of statistical process control. First, our data has a multi-modal 
distribution which results from a mixture of diflferent biological samples and classes. 
Therefore the assumption of normahty is only a rough approximation and t2ucl 



^OOmSSlSl PCTAEP03/03288 

24 

from Equation should be regarded witb caution. Secondly, as we have seen in the 
last sections, microarray experiments produce outUers, resulting in transgression of 
the UCL. This means sporadic violations of the UCL are normal and do not indicate 
that the process is out of control. The third issue is that we have to use the 
assumption that a microarray study will not systematically change its data generating 
distribution over time. Therefore the experimental design has to be randomized or 
blodc randomized, otherwise a systematic change in the correctly measured 
biological data will be intapreted as an out of control situation (e.g. when all patients 
with thfe- same disease subtype are measured in one blodc). FinaUy, the question 
remains of what time means in the context of a microarray experiment. Beside the 
biological variation in the data, there are a multitude of dififerent parameters which 
can systematically alter the final hybridization intensities. The experimental series 
should stay constant with regard to aU of them, hi our experience the best initial 
choice is to order the chips by their date of hybridization, which shows a very high 
correlation to most parameters of interest 

Although it is certainly interesting to look how single hybridization experiments mi 
compare to the HDS, we are more intCTested in how the general behavior of the chip 
process changes over time. Therefore we define the current data set (CDS) (also 
referred to above as the test data set) as {mi.NCDS/2v..^i,...,mi+NCDS/2} . v(iiere 
i is the time of interest This allows us to look at the data distribution in a time 
interval of size Nqds around i . In analogy to the classical setting in statistical 

process control we can define the t2 -distance between the HDS and the CDS as in 
Equation 9: 

where S is calculated fiiom the sample covariance matrices Srds and Sqds as 
Equation 10: 

Nhds + Ncos - 2 

Although it is possible to use T2^ -distance between the historical and current data 
set to test for HHDS=MCDS . ^ information is relatively meaningless. The 
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hypothesis that the means of HDS and CDS are equal would ahnost always be 
rejected, due to the high power of the test What is of more interest is T ilsell^ which 
is the amount by which the two sample means differ in relation to the standard 
deviation of flie data. 

In order to see whether an observed change of the ^stance comes fiom a 
simpletranslationitisalsointerestingtocomparethetwosamplecovaiian^ Shds 
and ScDS. A translation in log(CG/rG) space means that the hybridization 
intensities of HDS and CDS differ only by a constant factor (e.g. a change in pmbe 
concentration). Hiis situation can be detected by looking at 

Which is the test statistics of the likelihood ratio test for different covariance 
matrices. It gives a distance measure between the two covariance matrices (i.e. L=0 
means equal covaiiahces). 

Before we can apply the described methods to a real miooanay data set we have 
again to solve the problan that we need a non-singular and outlier resistant estimate 
of Shds and ScDS • What makes the problem even harder dian is that we cannot a 
priori know how a change in experimental conditions will affect our data. In contrast 
to the last section, the simple approximation of Shds by its first principle 
components wiU not work here. The reason is that dianges in the experimental 
conditions outside the HDS will not necessarily be r^^esented in the first prindpole 
components of Sj^g 

The solution is to first embed all the experimental data into a lower dimensional 
space by PCA. This works, because any significant change in the experimental 
conditions WiU be captured by one of the first principle components. Shds and 
ScDS can then be reliably computed in the lower dimensional embedding. The 
problem of robustness is simply solved by first using robust PCA to remove outUers 
before performing the actual embedding and before computing the sample 
covariances. A summaiy of our algorithm is: 
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1 . Order chips according to the parameter of interest e.g. date of hybridisation. 

2. Take the set of ordered chips remove outlieis with rPCA for 
computing the first d eigoivectors with dassical PCA. 

3. Project the set of all ordered chips {'»u -.«Hr\ , into the ^^-dimensinal 
subspace spanned by &e computed vectors. 

4. Select the first ^hos *ip {«m.-...«i*„^,1 ^ historical data set, remove outlieis 
with rPCA for computing f^Hos *nd ^hds^ 

5. For every time index ^^^•-'^c') 

(a) Compute distance between i^Hos 

touI dala sci, remove outfiers wUb ifCA for campai" 

111. Compute L-dis(ancebctTOca 5^1^^ aiuI5t|>5^. 

6. Generate controlling chart by plotting ^ 

With the computed values for t2 , and L we can generate a plot that 

visualizes the quality development of the chip process over time, a so called t2 
control chart 

Results 

The first example is shown in Fig.4, which demonstrates how our algorithm detects a 
diange in hybridization temperature. As can be expected the t2 -value grows with 
an increase in hybridization temperature. The systematic increase of the L -distance 
indicates that this is not only caused by a simple translation in methylation space. 
The process has to be regarded as cleariy out of control, due to the observation that 
ahnost all chips are above the UCL after the temperature change and the process 
center has drifted more than Tw=4 standard deviations away from its original 
location. 
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Fig.6 shows how our method detects the simulated handling error in the Lymphoma 
data set The affected chips can be clearly identified by the significant increase in the 
t2 ^iistances as well as by their change in Ihe covariance stracture. 
Finally, Fig. 5 shows the t2 control chart of the ALIVAML study. It cleariy 
indicates that the experimental conditions significanUy changed two times ov« the 
course of the study. A look at the L ^stance reveals that the covariance within the 
two detected artefect blocks is identical to the HDS. A change in covariance can be 
detected only when the CDS window passes the two borders. TTiis cleariy indicates 
that the observed effect is a simple translation of the process mean. 
The major practical problem is now to identify the reasons for the changes. In this 
regard the most valuable information from the t2 control chart is the time point of 
process change It can be cross^edced with the laboratory protocol and the pix>cess 
parameters which have changed at the same time can be id«itified. In our case the 
two process shifts corresponded to the time of rq,lacement of re-synthesized probe 
oligos for slide production, which were obviously deUvered at a wrong 
concentration. After exclusion of the affected CpG positions from the analysis the 

t2 chart showed normal behavior and the overall noise level of the data set was 

significantly reduced. 

Discussion 

Taken together, we have shown that robust principle components analysis and 
techniques of statistical process control can be used to detect flaws in microarray 
experiments. Robust PCA has proven to be able to automatically detect nearly all 
cases of outlier chips identified by visual inspection, as well as microairays with 

unconspicous image quality but saturated hybridization signals. With the t2 control 
chart we introduced a tool that facilitates the detection and assessment of even minor 
systematic changes in large scale microairay studies. 

A major advantage of both methods is that they do not rely on an explicit modeling 
of the microarray process as they are solely based on the distribution of the actual 
measurements. Having successfiilly applied our methods to the example of DNA 
methylation data, we assume that the same results can be achieved with other types 
of microarray platforms. The sensitivity of the methods improve with increasing 



wo 03/083757 PCr/EP03/03288 

28 

study sizes, due to their multivariate nature. This makes them particularly suitable for 
medium to large scale expaiments in a high throughput environmait 
The retrospective analysis of a study with our methods can greatly improve results 
and avoid misleading biological interpretations. When the T^ control chart is 
5 monitored in real time a given quality level can be maintained in a very cost effective 
way. On the one hand, this allows for an immediate correction of process parameters. 
On the other hand, this makes it possible to specifically repeat only those slides 
affected by a process artefect This guarantees hig^ quaUty while minimiTitT g the 
numbo' of repetitions. 

1 0 A general shortcoming of t2 control charts is that they only indicate that some^g 
went wrong, but not what was exacfly the source. Therefore we have used the time at 
which a significant change happened in order to identify the reqwnsible process 
parameter. We have shown how a quantification of the change in covariance 
structure provides additional information and permits to discriminate be^een 

15 different problems like changes in probe concentration and accidental handling 
errors. 

Example 2 

In one aspect, the method according to the disclosed invention provides a means for 
20 automatically generating a concise report based on the disclosed methods for quality 
monitoring of laboratory process perfonmance. In the disclosed embodiment this 
report is structured in sections starting with summary table (see Table 1) of the 
performance grades for several evaluation categories of the individual experiment 
units, a section detaiUng each evaluation category in turn in a table of grades for this 
25 category, the corresponding poformance variables the grades are based on and a set 
of graphical displays implemented as panel of box plots (see Figure 7) displaying the 
thresholds used for grading, and a table of details containing all evaluation grades for 
eadi experimental unit The report can be generated by means of a computer 
program which outputs the result in file foraiats HTML, Adobe PDF, postscript, and 
30 variants thereof 
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Table 1 



Chip 


Vis. 
Grade 


Rob. PCA 
-Thr. 


BG 


•ispoi 

1 

i 


GEC 


■ 

) SAT 


0100870030-68406- 
57115 




-0.9 


bad 


good 


bad 


J- 
goodl 


0100870296-68421- 
57110 


I 
1 

I 


-1.5 


bad 


j 

■good 

\ 

i 


bad 


goodj 


0100870569-68422- 
57121 


2 


-2.7 


bad 


\ 

good 


bad 


good: 


0100870907-68447- 

57105 


2 


-2 


dubious 


good 


bad 


good! 

i 


0100870949-68451- 
57127 


2 


-1.8 


dubious! 


good 


bad 


good; 


0100871228-68460- 
57104 

1 

• ,, I 


2 


-1.9 1 

1 


dubious: 


good 


bad 


good- 


0100871947-68487- 
57109 


1 


-1.6 


dubious.' 

— j 


good 


bad 


good: 


0100871997-68491- 
57128 


2 


-2.1 


bad 1 

1 


good 


bad 


good: 


0100872531-68503- 
57103 


6 


5.6 


bad 


good 


good 


good: 


0100872549-68495- 
57112 


1 


2.3 


j 

?ad jj 


;ood 


bad 


goodj 


0100872573-68504- 
57129 


I 


•0.2 1 


iad 1^ 

j. 


;ood 1 


^ad J 


good! 


0100872812-68517- 
57106 


I 


1.4 ( 


lubiousli 

i 


;ood I 


>ad j 


;ood 

I 
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10 



0100870056-68408- 
57133 


3 


-1.8 


1 1 

bad ' 
j 1 


good 


bad 


i 

good: 

i 


0100870072-68410- 
57139 


3 


-2.1 


n>ad 1 

1 1 


good 


bad 


1 

good; 



Table! shows the summaiy table of category grades for each experimental unit: 
From left to right, the columns represent the identifier of the experimental unit, 
the human expert visual grade, the distance for the experimental unit from the 
estimate the robust mean location ofthe set of expoiments, the background 
category grade, the spot charactaistic category grade, the geometry characteristic 
grade and the intensity saturation category grade are stated. Three grade levels 
are used, good, dubious, bad, based on the grades calculated for each category 



mtum. 



Table 2 shows the complete summary table of all chips analysed in study ' 1 ' 
according to Figure 7, of which Table 1 represents the most informative subset 



Table 2 



Chip 


Vis. 
Grade 


Rob. 
PCA- 
Thr. 


BG 


SPOT 1 

i 


GEO 


SAT i 


0100870030- 
68406-57115 


3 


-0.9 


bad 


good 


bad 


good ; 


0100870296- 
68421-57110 


2 


-1.5 


bad 


good 


bad 


good 1 


0100870569- 
68422-57121 


2 


-2.7 


bad 


i 

good ! 

i 


bad 


i 

good i 


0100870907- 
68447-57105 


2 


-2 


dubious 


i 

good > 

i 


bad 


1 

good 1 

i 
j 
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0100870949- 
68451-57127 


2 




-1.8 


dubious 


good 


Ibad 

i 

j 


good 


0100871228- 
68460-57104 


2 


-1.9 


dubious 

i 


good 


ibad 


1 

good ! 


0100871947- 
68487-57109 


1 

i 


-1.6 


dubious 


good 


bad 


good ' 

j 


0100871997- 
68491-57128 


1 

2 


-2.1 


bad 


good 


bad 


j 

good j 


0100872531- 
68503-57103 


6 


5.6 


bad 


good 

i 
i 


good 


good 


0100872549- 
68495-57112 


1 


2.3 


bad 


good i 

1 


bad 


good j 


0100872573- 
68504-57129 


2 


-Q2 


bad 


f 

good : 

1 

. I 


bad 


j 

good i 


0100872812- 
68517-57106 


2 


-1.4 


dubious 


i 

good 


bad 


good 


0100870056- 
68408-57133 


3 


-1.8 


bad 


good 


bad 


good . 


0100870072- 
68410-57139 


3 


-2.1 


bad 


good 


bad 


good 1 


0100870098- 
68412-57145 


3 


-1.2 


good 


good \ 

.... -..J 


bad 


good 1 

j 

> 


0100870171- 
68417-57183 


3 


-1.3 


good 


good 1 


bad 


good 


0100870402- 
68426-57164 


2 


-2.2 


dubious 


good 1 


t>ad 


good 


0100870527- 

* 

68437-57107 


2 


■1.6 


bad 


good h 

. 1 


i>ad 


good 
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Brief Description of Drawing 
Figure .: T^ic artefact. in mcro««y b.«a 

th.co.el..o„be^.e«^e<.aver,g«,h,Wdis.d„.p„ffl„. -a- staws . typical 

due to the approximately nonnaUy dittoed experimental noise. A typical duo 
c^m as "^acceptaHe- By vi^ , ^ .3. ^^^^^ 

^ow«, no .gnal. renting in a log ratio of = after threshoMing the signals to^To 
Tl.e opposite ca. is shov. in Fig. chip has vet, «r«^ hybridi^on signals 
»d was Classified as "good" by visa, inspectio.. However, the hyhtiZ! 
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conditions have been too unspedfic and most of the oUgos were saturated. 'D* shows 
a chip classified as "accq,table". Hybridization signals were weak compart to 
background intensity, resulting in a high amount of noise. 'E' shows the comparison 
of group averages over 64 chips in a study hybridised at 42«C and 48 chips fiom the 
5 same study hybridised at 44°C. T' shows the comparison of group averages over 447 
regular chips from one study and 200 chips with a simulated accidental probe 
exchange during slide production afifecting 1 2 positions on the chip. 

Figure 2: Comparison between univariate (central rectangle) and mulivariate 
10 (ellipse) upper confidence intervals. P. is not detected as outlier by univariate 
distance, but by multivariate T^-statistic . P2 is erroneously detected as outlier by 
the univariate distance, but not by multivariate T^-statistic. For P, (non-outlier) 
and P< (outlier) both methods give the same decision. 
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Figure 3: f^-Distances of robust PCA versus classical PCA for the Lymphoma 
dataset The flcL values are shown as two dotted lines. Chips to the right of die 
vertical line were detected as ouflier. by robust PCA. Chips above die horzontal line 
were detected as oudiers by classical PCA. Chips classified as 'anacceptable' by 
visual inspection are shown as squares, 'acceptable* diips as triangles and 'good' 
chips as crosses. Note diat 'goos' chips detected as oudiers by rPCA have all been 
confimied to show saturated hybridization signals. The ^cl values are calculated 
with ^10 and significance level a =0.025. 

Figure 4: T^ control chart of AU/AML study. Over die course of die experiment a 
total of 46 ohgomeres for 35 different CpG positions had to be re-synfliesized 
OUgos were replaced at time indices 234 and 315. The upper plot shows die T - 
distance of 433 hybridizations, where die grey curve shows die nmning average as 
computed by a lowess fit TTie lower plot shows die r„ and L -distance between 
HDS and CDS witii a window size of Njjj^g=N^g= 75 
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Figure 5: control chart of simulated probe exchange in the Lymphoma data set. 
Between chips 300 and 500 an accid«,tal oUgo probe exchange during slide 
production was simulaed by rotating 1 2 randomly selected CpG positions. Tke upper 
plot shows the TKiistance of all 647 hybridisations, where the line of the curve shows 
the nmmng averageas computed by a lowess fit Triangular points are chips 
classified as 'unacceptable' by visual inspection. ll.e lower plot shows the r„ and L- 

distance between HDS and CDS with a window size of N^„=N =7^; 

HDS CDS 

Figu,.6: T'^"'™! chart ofienvcmuree^entTtasameAUyAMLsaomles 
w«. hybridieed 4 diff«m temperatures, m upper pi, shows tte TsKaance of aH 
207 hybridizations to fee HDS, wh«e fl.e Hue of fte curve shows the t^niug 
average as eomputed by a lowess fit The lower plot show, the T". anj j^^^ 
betweai HDS and CDS with a window size of Nj^jg-N^jj.. 30 



Figure 7: A ^ of box plots, wherein the experimental series des^ibed a«o,ding 
to Example2 eo-reponds to box plot T. TV variable distribution summarized is 
.he 75 % quantiles of the standari deviation, of the per spot pereentage of pixels 
that surpass the per spot one standard deviation about 4e mean of aU pixd vataes 
threshold. He lower horizontal line disptays fl.e 75 % quantile ^ the 95% .pantile 
20 of«l»s distribution ealculated from the eombined five data sets shown in the 

individual box plots ,0 the -r to -S-. -nte thus defin«l thresholds are used for gn^ng 
ae expenmental unit with re^>ect to this single variabla 



